Screen Capture of Github:
Screen Capture of File Structure location:
[lab name].Rproj (for the first lab this Lab_1.Rproj)
Screen Capture of RStudio:
Knit the documentScreen Capture of Kniting:
Screen Capture of resulting html file:
The readings are based on this philosophy.
Knit this Rmarkdown document into an html file (lab_week_1_part2.html). This is the file to turn on canvas.
Below are four vectors of data:
n<-15
a<-LETTERS[sample(n,replace=TRUE)]
b<-rpois(n,1)
c<-rep(TRUE,n)
d<-rbinom(n,1,.25)
In the R window below create a list, matrix and data.frame object out of this data.
## Your R code here
Use tidyvers filter and `select to choose only the rows with 1 in d and columns a and d.
## Your R code here
lm objectsThe classic linear model is of the form,
\[y_i = \beta_0+\sum_i^k \beta_i x_i\] When using R to fit lm objects we use the formula enviroment.
Let’s grab an example data set from the excellent UCLA Statistics Consulting website.
For our data analysis below, we will use the crime dataset that appears in Statistical Methods for Social Sciences, Third Edition by Alan Agresti and Barbara Finlay (Prentice Hall, 1997). The variables are state id (sid), state name (state), violent crimes per 100,000 people (crime), murders per 1,000,000 (murder), the percent of the population living in metropolitan areas (pctmetro), the percent of the population that is white (pctwhite), percent of population with a high school education or above (pcths), percent of population living under poverty line (poverty), and percent of population that are single parents (single). It has 51 observations. We are going to use poverty and single to predict crime. - UCLA
## why do do I have foreign::read.dta? what is this?
cdata <- foreign::read.dta("https://stats.idre.ucla.edu/stat/data/crime.dta")
head(cdata)
sid state crime murder pctmetro pctwhite pcths poverty single
1 1 ak 761 9.0 41.8 75.2 86.6 9.1 14.3
2 2 al 780 11.6 67.4 73.5 66.9 17.4 11.5
3 3 ar 593 10.2 44.7 82.9 66.3 20.0 10.7
4 4 az 715 8.6 84.7 88.6 78.7 15.4 12.1
5 5 ca 1078 13.1 96.7 79.3 76.2 18.2 12.5
6 6 co 567 5.8 81.8 92.5 84.4 9.9 12.1
ggplot(data=cdata, aes(x=poverty, y=crime, group=1)) +
geom_line()+
geom_point()
ggplot(data=cdata, aes(x=single, y=crime, group=1)) +
geom_line()+
geom_point()
summary(ols <- lm(crime ~ poverty + single, data = cdata))
Call:
lm(formula = crime ~ poverty + single, data = cdata)
Residuals:
Min 1Q Median 3Q Max
-811.14 -114.27 -22.44 121.86 689.82
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1368.189 187.205 -7.308 2.48e-09 ***
poverty 6.787 8.989 0.755 0.454
single 166.373 19.423 8.566 3.12e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 243.6 on 48 degrees of freedom
Multiple R-squared: 0.7072, Adjusted R-squared: 0.695
F-statistic: 57.96 on 2 and 48 DF, p-value: 1.578e-13
opar <- par(mfrow = c(2,2), oma = c(0, 0, 1.1, 0))
plot(ols, las = 1)
From these plots, we can identify observations 9, 25, and 51 as possibly problematic to our model. We can look at these observations to see which states they represent.
cdata[c(9, 25, 51), 1:2]
sid state
9 9 fl
25 25 ms
51 51 dc
qqPlot(ols)
[1] 9 25
influencePlot(ols, id.n=3)
StudRes Hat CookD
9 3.1632792 0.04831771 0.14258909
25 -4.1109749 0.12668898 0.61387212
49 -0.6227287 0.17838219 0.02842712
51 2.7961066 0.53602150 2.63625193
d1 <- cooks.distance(ols)
r <- MASS::stdres(ols) ## here is package::function again!
a <- cbind(cdata, d1, r)
## Pretty Print
## How do we make this a function?
data.frame(lapply(a[d1 > 4/51, ],function(x){
if(any(is.numeric(x))){
return(round(x,3))
}
x}))
sid state crime murder pctmetro pctwhite pcths poverty single d1 r
1 1 ak 761 9.0 41.8 75.2 86.6 9.1 14.3 0.125 -1.397
2 9 fl 1206 8.9 93.0 83.5 74.4 17.8 10.6 0.143 2.903
3 25 ms 434 13.5 30.7 63.3 64.3 24.7 14.7 0.614 -3.563
4 51 dc 2922 78.5 100.0 31.8 73.1 26.4 22.1 2.636 2.616
Robust regression is done by iterated re-weighted least squares (IRLS). The command for running robust regression is rlm in the MASS package. There are several weighting functions that can be used for IRLS. We are going to first use the Huber weights in this example. We will then look at the final weights created by the IRLS process. This can be very useful.
summary(rr.huber <- MASS::rlm(crime ~ poverty + single, data = cdata))
Call: rlm(formula = crime ~ poverty + single, data = cdata)
Residuals:
Min 1Q Median 3Q Max
-846.09 -125.80 -16.49 119.15 679.94
Coefficients:
Value Std. Error t value
(Intercept) -1423.0373 167.5899 -8.4912
poverty 8.8677 8.0467 1.1020
single 168.9858 17.3878 9.7186
Residual standard error: 181.8 on 48 degrees of freedom
hweights <- data.frame(state = cdata$state, resid = rr.huber$resid, weight = rr.huber$w)
hweights2 <- hweights[order(rr.huber$w), ]
hweights2[1:15, ]
state resid weight
25 ms -846.08536 0.2889618
9 fl 679.94327 0.3595480
46 vt -410.48310 0.5955740
51 dc 376.34468 0.6494131
26 mt -356.13760 0.6864625
21 me -337.09622 0.7252263
31 nj 331.11603 0.7383578
14 il 319.10036 0.7661169
1 ak -313.15532 0.7807432
20 md 307.19142 0.7958154
19 ma 291.20817 0.8395172
18 la -266.95752 0.9159411
2 al 105.40319 1.0000000
3 ar 30.53589 1.0000000
4 az -43.25299 1.0000000
We can see that roughly, as the absolute residual goes down, the weight goes up. In other words, cases with a large residuals tend to be down-weighted. This output shows us that the observation for Mississippi will be down-weighted the most. Florida will also be substantially down-weighted. All observations not shown above have a weight of 1. In OLS regression, all cases have a weight of 1. Hence, the more cases in the robust regression that have a weight close to one, the closer the results of the OLS and robust regressions.
clse<-cluster.vcov(ols,cdata)
cat('Without Clustered SE\n')
Without Clustered SE
summary(ols)
Call:
lm(formula = crime ~ poverty + single, data = cdata)
Residuals:
Min 1Q Median 3Q Max
-811.14 -114.27 -22.44 121.86 689.82
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1368.189 187.205 -7.308 2.48e-09 ***
poverty 6.787 8.989 0.755 0.454
single 166.373 19.423 8.566 3.12e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 243.6 on 48 degrees of freedom
Multiple R-squared: 0.7072, Adjusted R-squared: 0.695
F-statistic: 57.96 on 2 and 48 DF, p-value: 1.578e-13
cat('\nWith Clustered SE\n')
With Clustered SE
coeftest(ols,clse)
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1368.1887 215.3707 -6.3527 7.235e-08 ***
poverty 6.7874 10.0913 0.6726 0.5044
single 166.3727 22.9567 7.2472 3.075e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The chapter on Nonlinear Regresion in R by Fox & Weisberg is available online, here.
We can generalize the basic regression formula so that \(y\) is function of \(m(x,\theta)\) and some error term \(\epsilon\). \(m\) can now be some arbitrary function, where our model is, \[y = m(x,\theta)+\epsilon\] Fox & Weisberg begin by considering a logistic growth model for \(m\).
\[ y = \frac{\theta_1}{1+\exp[-(\theta_2+\theta_3 x)]}+\epsilon\]
Good first step towards understanding the proopsed linear growth model is to use the curve function in R.
theta<-c(1,1,1)
curve(theta[1]/(1+exp(-(-theta[2] + theta[3]*x))), from=-5, to=5, main="(b)")
abline(h=1/2, v=1, lty=2)
theta<-c(1,1,2)
curve(theta[1]/(1+exp(-(-theta[2] + theta[3]*x))), from=-5, to=5, main="(b)")
abline(h=1/2, v=1, lty=2)
theta<-c(1,1,10)
curve(theta[1]/(1+exp(-(-theta[2] + theta[3]*x))), from=-5, to=5, main="(b)")
abline(h=1/2, v=1, lty=2)
Try various \(\theta\) values and save them as extra tabs.
theta<-c(1,1,1)
curve(theta[1]/(1+exp(-(-theta[2] + theta[3]*x))), from=-5, to=5, main="(b)")
abline(h=1/2, v=1, lty=2)
Changing the values of the parameters \(\theta = (\theta_1,\theta_2,\theta_3)\) stretches or shrinks the axes, and changes the rate at which the curve varies from its lower value at 0 to its maximum value.
Stylized Facts from J&F
If \(\theta_3 > 0\), then as \(x\) gets larger the term \(\exp[-(\theta_2+\theta_3 x)]\) gets closer to 0, and so \(m(x, \theta)\) will approach the value \(\theta_1\) as an asymptote.
Assuming logistic population growth therefore imposes a limit to population size.
If \(\theta_3 > 0\), as \(x\rightarrow \infty\), the term \(\exp[-(\theta_2+\theta_3 x)]\) grows large without bound and so the mean will approach 0.
The logistic growth curve is symmetric about the value of \(x\) for which \(m(x, \theta)\) is midway between 0 and \(\theta_1\).
It is not hard to show that \(m(x = -\frac{\theta_2}{\theta_3}, \theta) = \frac{\theta_1}{2}\), and so the curve is symmetric about the midpoint \(x = -\frac{\theta_2}{\theta_3}\).
The parameter \(\theta_3\) controls how quickly the curve transitions from the lower asymptote of 0 to the upper asymptote at \(\theta_1\), and is therefore a growth-rate parameter.
R has function for fitting non-linear regression, nlr. J&F formalize the NLR model as follows,
\[ y = E[y|x]+\epsilon = m(x,\theta)+ \epsilon\]
This makes the assumption that \(E[y|x]\) (read as expection of y conditioned on x) depends on \(x\) through the *kernal$ function \(m(x,\theta)\).
Typically, there are also assumptions made on \(\epsilon\)
Under these assumptons the nls function in R can be used to estimate \(\theta\) as the values that minimize the residual sum of squares.
\[S(\theta) = \sum w[y-m(\theta,x)]^2\]
Let’s consider the case where \(w=1\) and \(m(x,\theta)= \beta_0+\beta_1x\), then \[S(\beta) = \sum [y-(\beta_0+\beta_1x)]^2\] * To find the \(\theta\) of interest we simply want the \(\arg \min_{\theta} \left\{ \sum [y-(\beta_0+\beta_1x)]^2 \right\}\)
+ This is an optimization problem! * Notice that there is no requirement that minimizing \(S(\theta)\) has a closed form solution. In fact this will rarely be the case. * This will generally only be solvable numerically by an iterative algorithm such as Gauss-Newton which uses derivative solve this optimization problem.
Fitting the general nls model requires a bit more hand-holding than the usual ‘lm’ case.
Let’s look at the nls function
args(nls)
function (formula, data = parent.frame(), start, control = nls.control(),
algorithm = c("default", "plinear", "port"), trace = FALSE,
subset, weights, na.action, model = FALSE, lower = -Inf,
upper = Inf, ...)
NULL
How do we find initial conditions? Typically you need to work with the proposed function, in the case of growth model under consideration: \[ \begin{aligned} y \approx \frac{\theta_1}{1+\exp[-(\theta_2+\theta_3 x)]}\\ \frac{y}{\theta_1} \approx \frac{1}{1+\exp[-(\theta_2+\theta_3 x)]}\\ \frac{1}{\frac{y}{\theta_1}} \approx 1+\exp[-(\theta_2+\theta_3 x)]\\ \frac{1}{\frac{y}{\theta_1}}-1 \approx \exp[-(\theta_2+\theta_3 x)] \\ \frac{\frac{y}{\theta_1}}{1-\frac{y}{\theta_1}} \approx \exp[-(\theta_2+\theta_3 x)]\\ \frac{1-\frac{y}{\theta_1}}{\frac{y}{\theta_1}} \approx \exp[(\theta_2+\theta_3 x)]\\ \log\left( \frac{1-\frac{y}{\theta_1}}{\frac{y}{\theta_1}} \right) \approx \theta_2+\theta_3 x]\\ logit\left( \frac{y}{\theta_1} \right) \approx \theta_2+\theta_3 x \end{aligned} \]
This type of mathematical manipulation can be used to find good starting parameters. Basically the goal is to come up with something that can be done with OLS regression model to get initial values (there are other strategies, but this is a good one).
J&W have a data set of US Population by year and fit a NLS model it to it. Let’s try!
data("USPop")
plot(population ~ year, data=USPop, main="")
abline(lm(population ~ year, data=USPop))
Based on our derivation, we can use \(logit\left( \frac{y}{\theta_1} \right) \approx \theta_2+\theta_3 x\) to get our starting conditions. J&F pull 400 as a starting paramter for \(\theta_1\) (this is based on the US population of 308 Million).
summary(ols<-lm(logit(population/400) ~ year, USPop))
Call:
lm(formula = logit(population/400) ~ year, data = USPop)
Residuals:
Min 1Q Median 3Q Max
-0.23258 -0.11036 -0.02292 0.12850 0.19658
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.925e+01 8.757e-01 -56.24 <2e-16 ***
year 2.507e-02 4.618e-04 54.27 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1374 on 20 degrees of freedom
Multiple R-squared: 0.9933, Adjusted R-squared: 0.9929
F-statistic: 2946 on 1 and 20 DF, p-value: < 2.2e-16
theta0<-c(400,coef(ols))
pop.mod <- nls(population ~ theta1/(1 + exp(-(theta2 + theta3*year))),
start=list(theta1 = theta0[1], theta2 = theta0[2], theta3 = theta0[3]),
data=USPop, trace=TRUE)
summary(pop.mod)
Formula: population ~ theta1/(1 + exp(-(theta2 + theta3 * year)))
Parameters:
Estimate Std. Error t value Pr(>|t|)
theta1 440.833286 35.000130 12.60 1.14e-10 ***
theta2 -42.706981 1.839138 -23.22 2.08e-15 ***
theta3 0.021606 0.001007 21.45 8.87e-15 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.909 on 19 degrees of freedom
Number of iterations to convergence: 6
Achieved convergence tolerance: 1.965e-06
We can compute the CI straightforwardly,
Confint(pop.mod)
1142.14 : -42.70698109 0.02160591
466.1412 : -43.66489321 0.02213789
466.1346 : -43.65634716 0.02213341
466.1346 : -43.65664041 0.02213356
497.7078 : -44.54779253 0.02262871
493.3715 : -44.68491507 0.02270226
493.3706 : -44.67956486 0.02269951
493.3706 : -44.67971345 0.02269959
542.7795 : -45.57997651 0.02319767
539.2901 : -45.71318732 0.02326901
539.2891 : -45.70742409 0.02326605
539.2891 : -45.70756914 0.02326612
605.9723 : -46.616309 0.023767
603.3437 : -46.74098776 0.02383369
603.3426 : -46.73494751 0.02383058
603.3426 : -46.73509042 0.02383065
687.136 : -47.65135132 0.02433405
685.1334 : -47.76819037 0.02439649
685.1322 : -47.76181507 0.02439319
685.1322 : -47.76196565 0.02439327
1142.14 : -42.70698109 0.02160591
464.418 : -41.82945829 0.02111857
464.2982 : -41.87902845 0.02114371
464.298 : -41.87699565 0.02114267
464.298 : -41.87707843 0.02114271
486.9909 : -40.99500282 0.02065039
482.2826 : -41.10702918 0.02071069
482.282 : -41.10340886 0.02070885
482.282 : -41.10357589 0.02070893
517.3378 : -40.23050536 0.02021932
511.6385 : -40.34499469 0.02028106
511.6378 : -40.34129405 0.02027918
511.6378 : -40.34148392 0.02027928
559.3683 : -39.47781491 0.01979236
552.0519 : -39.59793152 0.01985728
552.0511 : -39.59399087 0.01985528
552.0511 : -39.5942142 0.0198554
612.7077 : -38.74065858 0.01937123
603.2196 : -38.86677499 0.01943958
603.2186 : -38.86251751 0.01943742
603.2186 : -38.86278011 0.01943755
677.2733 : -38.02011576 0.01895617
664.7898 : -38.15285715 0.01902833
664.7885 : -38.14818934 0.01902597
664.7885 : -38.14850058 0.01902613
82077.43 : 440.83328575 0.02160591
70303.29 : 211.39618466 0.02227898
3800.453 : 416.84714560 0.02208154
468.0991 : 421.86364236 0.02216842
465.4392 : 423.97062647 0.02216146
465.4368 : 424.10387250 0.02216117
465.4368 : 424.10831947 0.02216116
490.345 : 407.71185884 0.02270551
487.1578 : 409.70421552 0.02270351
487.1575 : 409.74290476 0.02270342
487.1575 : 409.74452280 0.02270341
525.4295 : 395.09449653 0.02325647
522.9987 : 396.7621240 0.0232547
522.9984 : 396.80016200 0.02325461
522.9984 : 396.8021399 0.0232546
574.9162 : 383.60770117 0.02381652
572.9518 : 385.04668783 0.02381492
572.9515 : 385.08428441 0.02381482
572.9515 : 385.08661866 0.02381482
638.6179 : 373.14266243 0.02438595
637.013 : 374.3933217 0.0243845
637.0127 : 374.43003229 0.02438439
637.0127 : 374.43268370 0.02438439
716.5009 : 363.5702661 0.0249651
715.1759 : 364.66472061 0.02496377
715.1754 : 364.70027153 0.02496366
715.1754 : 364.70320008 0.02496365
127918.4 : 440.83328575 0.02160591
2471.314 : 375.33500774 0.02129055
781.7662 : 434.25026378 0.02107477
466.5408 : 459.50499024 0.02104805
465.7549 : 460.13543612 0.02104816
465.7549 : 460.13458006 0.02104816
492.8714 : 478.67426923 0.02051242
488.9907 : 481.68757566 0.02050969
488.9906 : 481.70217018 0.02050967
532.4315 : 502.82744843 0.01998223
527.5238 : 506.46428102 0.01997914
527.5236 : 506.46483410 0.01997916
587.8096 : 530.71544188 0.01945962
581.3876 : 535.20818829 0.01945606
581.3872 : 535.18674409 0.01945612
659.1937 : 563.29834410 0.01894419
650.6288 : 568.94185005 0.01894005
650.6279 : 568.88522500 0.01894016
650.6279 : 568.88730080 0.01894016
92336.39 : 440.83329 -42.70698
56496.47 : 241.08495 -41.42246
4547.262 : 450.21372 -41.85451
471.8181 : 456.49645 -41.67643
465.6882 : 460.22626 -41.69691
465.6852 : 460.3593 -41.6973
465.6852 : 460.36059 -41.69731
492.7987 : 479.2001 -40.7232
488.6896 : 482.20918 -40.72837
488.6895 : 482.21866 -40.72837
532.0282 : 503.74239 -39.77426
526.7959 : 507.3806 -39.7801
526.7957 : 507.37662 -39.78005
586.8953 : 532.1639 -38.8457
579.9889 : 536.67055 -38.85243
579.9885 : 536.6464 -38.8523
657.5628 : 565.49644 -37.93787
648.2509 : 571.17875 -37.94572
648.25 : 571.1239 -37.9455
648.25 : 571.1259 -37.9455
143344.5 : 440.83329 -42.70698
2879.984 : 358.21890 -43.27218
660.2653 : 403.67316 -43.66921
465.9187 : 423.19218 -43.72047
465.44 : 423.86907 -43.72134
465.44 : 423.87535 -43.72137
490.6442 : 407.25400 -44.71562
487.2766 : 409.25699 -44.71946
487.2763 : 409.29141 -44.71961
487.2763 : 409.29271 -44.71962
525.8817 : 394.47734 -45.73379
523.321 : 396.15218 -45.73717
523.3207 : 396.18656 -45.73734
523.3207 : 396.18820 -45.73735
575.6098 : 382.87352 -46.77141
573.5536 : 384.31511 -46.77446
573.5532 : 384.34944 -46.77464
573.5532 : 384.35143 -46.77465
639.6297 : 372.32039 -47.82898
637.9587 : 373.57083 -47.83176
637.9583 : 373.60462 -47.83195
637.9583 : 373.60693 -47.83197
717.894 : 362.68207 -48.90703
716.5205 : 363.77459 -48.90956
716.5201 : 363.80753 -48.90976
716.5201 : 363.81012 -48.90978
Estimate 2.5% 97.5%
theta1 440.83328575 381.49365542 536.93859220
theta2 -42.70698109 -46.58630995 -39.11083492
theta3 0.02160591 0.01961309 0.02371883
Where \(\hat{\sigma} = \sqrt{S(\hat{\theta})/(n-k)\). Let’s do this calculation in R!
S<-function(theta=c(1,1,1)){
## My code here!
return(1)
}
n<-NROW(USPop)
k<-3
sigma_hat<-sqrt(S(c(1,1,1))/(n-k))
sigma_hat
[1] 0.2294157
We can get the SE for this ratio by the delta method (calculus approximation):
deltaMethod(pop.mod, "-theta2/theta3")
Estimate SE 2.5 % 97.5 %
-theta2/theta3 1976.6341 7.5558 1961.8250 1991.4
Which gives a SE of about 7.5 years.
par(mfrow=c(1,2))
plot(population ~ year, USPop, xlim=c(1790, 2100), ylim=c(0, 450))
with(USPop, lines(seq(1790, 2100, by=10),
predict(pop.mod, data.frame(year=seq(1790, 2100, by=10))), lwd=2))
points(2010, 308.745538, pch=16, cex=1.3)
abline(h=0, lty=2)
abline(h=coef(pop.mod)[1], lty=2)
abline(h=0.5*coef(pop.mod)[1], lty=2)
abline(v= -coef(pop.mod)[2]/coef(pop.mod)[3], lty=2)
with(USPop, plot(year, residuals(pop.mod), type='b'))
abline(h=0, lty=2)
Let’s check the quality of the predictions from J&F for 2000 t0 2019. We can grab the data from the US Census Burea’s website.
wiki<-"https://en.wikipedia.org/wiki/Demographics_of_the_United_States" %>% read_html() %>% html_table(fill=TRUE, header=TRUE)
us_pop<-wiki[[8]][,1:2]
names(us_pop)<-c("Year","Pop")
us_pop$Pop_num<-as.numeric(gsub(",","",us_pop$Pop))/1000000 ## Normalize to Millions
uspop01_19<-us_pop %>% filter(Year%in%2001:2019)%>%select(Pop_num)
p_uspop01_10<-predict(pop.mod, data.frame(year=seq(2001, 2019, by=10)))
data.frame(TruePop=uspop01_19,PredPop=p_uspop01_10,diff=uspop01_19-p_uspop01_10)
Pop_num PredPop Pop_num.1
1 285.082 277.1317 7.95031366
2 287.804 298.6838 -10.87983715
3 290.326 277.1317 13.19431366
4 293.046 298.6838 -5.63783715
5 295.753 277.1317 18.62131366
6 298.593 298.6838 -0.09083715
7 301.580 277.1317 24.44831366
8 304.375 298.6838 5.69116285
9 307.007 277.1317 29.87531366
10 309.330 298.6838 10.64616285
11 311.583 277.1317 34.45131366
12 313.874 298.6838 15.19016285
13 316.129 277.1317 38.99731366
14 319.113 298.6838 20.42916285
15 321.442 277.1317 44.31031366
16 323.100 298.6838 24.41616285
17 325.719 277.1317 48.58731366
18 327.167 298.6838 28.48316285
Compute the NLS and LM models for the population of Seattle and California. Fit the models from past to 2000 and predict 2001 to present. Which performs better, the NLS growth model or the linear model?
wiki<-"https://en.wikipedia.org/wiki/Seattle" %>% read_html() %>% html_table(fill=TRUE, header=TRUE)
seattle_pop<-wiki[[3]]
seattle_pop<-seattle_pop[2:18,1:2]
names(seattle_pop)<-c("year","pop")
seattle_pop$year<-as.integer(gsub("Est.","",seattle_pop$year))
seattle_pop$pop_numeric<-as.numeric(gsub(",","",seattle_pop$pop))
seattle_pop
year pop pop_numeric
2 1860 188 188
3 1870 1,107 1107
4 1880 3,533 3533
5 1890 42,837 42837
6 1900 80,671 80671
7 1910 237,194 237194
8 1920 315,312 315312
9 1930 365,583 365583
10 1940 368,302 368302
11 1950 467,591 467591
12 1960 557,087 557087
13 1970 530,831 530831
14 1980 493,846 493846
15 1990 516,259 516259
16 2000 563,374 563374
17 2010 608,660 608660
18 2018 744,955 744955
plot(pop_numeric~year,data=seattle_pop)
abline(lm(pop_numeric ~ year, data=seattle_pop))
wiki<-"https://en.wikipedia.org/wiki/Demographics_of_California" %>% read_html() %>% html_table(fill=TRUE, header=TRUE)
california_pop<-wiki[[8]]
california_pop<-california_pop[,1:2]
names(california_pop)<-c("year","pop")
california_pop$year<-as.integer(california_pop$year)
california_pop$pop_numeric<-as.numeric(gsub(",","",california_pop$pop))/1000
head(california_pop)
year pop pop_numeric
1 1909 2,282,000 2282
2 1910 2,406,000 2406
3 1911 2,534,000 2534
4 1912 2,668,000 2668
5 1913 2,811,000 2811
6 1914 2,934,000 2934
plot(log(pop_numeric)~year,data=california_pop)
abline(lm(log(pop_numeric) ~ year, data=california_pop))